Goto

Collaborating Authors

 prediction module







DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Neural Information Processing Systems

Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. To optimize the prediction module in an end-to-end manner, we propose an attention masking strategy to differentiably prune a token by blocking its interactions with other tokens. Benefiting from the nature of self-attention, the unstructured sparse tokens are still hardware friendly, which makes our framework easy to achieve actual speed-up. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31% $\sim$ 37% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision transformers. Equipped with the dynamic token sparsification framework, DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.



Importance Ranking in Complex Networks via Influence-aware Causal Node Embedding

Gao, Jiahui, Zhou, Kuang, Zhu, Yuchen, Wu, Keyu

arXiv.org Artificial Intelligence

Abstract--Understanding and quantifying node importance is a fundamental problem in network science and engineering, underpinning a wide range of applications such as influence maximization, social recommendation, and network dismantling. Prior research often relies on centrality measures or advanced graph embedding techniques using structural information, followed by downstream classification or regression tasks to identify critical nodes. However, these methods typically decouple node representation learning from the ranking objective and rely on the topological structure of target networks, leading to feature-task inconsistency and limited generalization across networks. This paper proposes a novel framework that leverages causal representation learning to get robust, invariant node embeddings for cross-network ranking tasks. Firstly, we introduce an influence-aware causal node embedding module within an autoencoder architecture to extract node embeddings that are causally related to node importance. Moreover, we introduce a causal ranking loss and design a unified optimization framework that jointly optimizes the reconstruction and ranking objectives, enabling mutual reinforcement between node representation learning and ranking optimization. This design allows the proposed model to be trained on synthetic networks and to generalize effectively across diverse real-world networks. Extensive experiments on multiple benchmark datasets demonstrate that the proposed model consistently outperforms state-of-the-art baselines in terms of both ranking accuracy and cross-network transferability, offering new insights for network analysis and engineering applications--particularly in scenarios where the target network's structure is inaccessible in advance due to privacy or security constraints. Complex networks provide a powerful framework for modeling and analyzing a wide range of systems across diverse domains, including social networks, transportation systems, and biological networks [1]. In these networks, nodes represent entities within a real system such as individuals, infrastructure components, or functional units, while edges capture interactions or relationships between them. A key challenge in network science and engineering is identifying important nodes, as they play pivotal roles in maintaining network functionality, performance, stability, and robustness [2].


VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction

Martins, Stephane Da Silva, Aldea, Emanuel, Hégarat-Mascle, Sylvie Le

arXiv.org Artificial Intelligence

Multi-agent trajectory prediction is a key task in computer vision for autonomous systems, particularly in dense and interactive environments. Existing methods often struggle to jointly model goal-driven behavior and complex social dynamics, which leads to unrealistic predictions. In this paper, we introduce VISTA, a recursive goal-conditioned transformer architecture that features (1) a cross-attention fusion mechanism to integrate long-term goals with past trajectories, (2) a social-token attention module enabling fine-grained interaction modeling across agents, and (3) pairwise attention maps to show social influence patterns during inference. Our model enhances the single-agent goal-conditioned approach into a cohesive multi-agent forecasting framework. In addition to the standard evaluation metrics, we also consider trajectory collision rates, which capture the realism of the joint predictions. Evaluated on the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy with improved interaction modeling. On MADRAS, our approach reduces the average collision rate of strong baselines from 2.14% to 0.03%, and on SDD, it achieves a 0% collision rate while outperforming SOTA models in terms of ADE/FDE and minFDE. These results highlight the model's ability to generate socially compliant, goal-aware, and interpretable trajectory predictions, making it well-suited for deployment in safety-critical autonomous systems.